From first principles to the Burrows and Wheeler transform and beyond, via combinatorial optimization
نویسندگان
چکیده
We introduce a combinatorial optimization framework that naturally yields a class of optimal word permutations. Our framework provides the first formal quantification of the intuitive idea that the longer the context shared by two symbols in a word, the closer those symbols should be to each other in a linear order of the symbols. The Burrows and Wheeler transform [6], and the compressible part of its analog for labelled trees [10], are special cases in the class. We also show that the class of optimal word permutations defined here is identical to the one identified by Ferragina et al. for compression boosting [9]. Therefore, they are all highly compressible. We also investigate more general classes of optimal word permutations, where relatedness of symbols may be measured by functions more complex than context length. In this case, we establish a non-trivial connection between word permutations and Table Compression techniques presented in Buchsbaum et al. [5], on one hand, and a universal similarity metric [17] with uses in Clustering and Classification [8]. Unfortunately, for this general problem, we provide instances that are MAX-SNP hard, and therefore unlikely to be solved or approximated efficiently. The results presented here indicate that, contrary to folklore, the key feature of the Burrows and Wheeler transform seems to be the existence of efficient algorithms for its computation and inversion, rather than its compressibility. Finally, for completeness, we also provide solution to an open problem implicitly posed in [6] regarding the computation of the transform.
منابع مشابه
Burrows-Wheeler compression: Principles and reflections
After a general description of the Burrows Wheeler Transform and a brief survey of recent work on processing its output, the paper examines the coding of the zero-runs from the MTF recoding stage, an aspect with little prior treatment. It is concluded that the original scheme proposed by Wheeler is extremely efficient and unlikely to be much improved. The paper then proposes some new interpreta...
متن کاملOn the combinatorics of suffix arrays
We prove several combinatorial properties of suffix arrays, including a characterization of suffix arrays through a bijection with a certain well-defined class of permutations. Our approach is based on the characterization of Burrows-Wheeler arrays given in [1], that we apply by reducing suffix sorting to cyclic shift sorting through the use of an additional sentinel symbol. We show that the ch...
متن کاملCombinatorial Transforms : Applications in Lossless Image Compression
Common image compression standards are usually based on frequency transform such as Discrete Cosine Transform. We present a different approach for lossless image compression, which is based on a combinatorial transform. The main transform is Burrows Wheeler Transform (BWT) which tends to reorder symbols according to their following context. It becomes one of promising compression approach based...
متن کاملLossless and nearly-lossless image compression based on combinatorial transforms. (Compression d'images sans perte ou quasi sans perte basée sur des transformées combinatoires)
Common image compression standards are usually based on frequency transform such as Discrete Cosine Transform or Wavelets. We present a different approach for lossless image compression, it is based on combinatorial transform. The main transform is Burrows Wheeler Transform (BWT) which tends to reorder symbols according to their following context. It becomes a promising compression approach bas...
متن کاملLightweight LCP construction for very large collections of strings
The longest common prefix array is a very advantageous data structure that, combined with the suffix array and the Burrows-Wheeler transform, allows to efficiently compute some combinatorial properties of a string useful in several applications, especially in biological contexts. Nowadays, the input data for many problems are big collections of strings, for instance the data coming from “next-g...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Theor. Comput. Sci.
دوره 387 شماره
صفحات -
تاریخ انتشار 2007